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Abstract 


The purpose of this study is to leverage modern technology (such as mobile or web apps in 
Beckman et al. ( 2Q14[ )) to enrich epidemiology data and infer the transmission of disease. 
Homogeneity related research on population level has been intensively studied in previous 
work. In contrast, we develop hierarchical Graph-Coupled Hidden Markov Models (hGCH- 
MMs) to simultaneously track the spread of infection in a small cell phone community 
and capture person-specific infection parameters by leveraging a link prior that incorpo¬ 
rates additional covariates. We also reexamine the model evolution of the hGCHMM from 
simple HMMs and LDA, elucidating additional flexibility and interpret ability. Due to the 
non-conjugacy of sparsely coupled HMMs, we design a new approximate distribution, al¬ 
lowing our approach to be more applicable to other application areas. Additionally, we 
investigate two common link functions, the beta-exponential prior and sigmoid function, 
both of which allow the development of a principled Bayesian hierarchical framework for 
disease transmission. The results of our model allow us to predict the probability of in¬ 
fection for each person on each day, and also to infer personal physical vulnerability and 
the relevant association with covariates. We demonstrate our approach experimentally on 
both simulation data and real epidemiological records. 

Keywords: Flu Diffusion, Social Networks, Dynamic Bayesian Modeling, Message Pass¬ 
ing, Heterogeneity. 


1. Introduction 


Disease diffusion modeling is an important topic in medical informatics Mould (2012). Re¬ 
cently, much emphasis has been placed on personalized medical treatment and advice, en¬ 
couraging prediction of disease on an individual level. A notable example is Apple Inc., 
who recently released the apple watch and built-in Health apps. However, the majority of 
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research on detecting patterns and discovering the risks of infectious disease has been at 
the population level, such as Google flu trend prediction. This paper aims to incorporate 
sensor and app collected data from mobile devices for individualized prediction of infectious 
disease risk and transmission. The data may contain, or allow us to learn, information 
including personal health condition, physical location and social contacts. 


From dynamic social network information, recorded via a smartphone, we can capture 
transmission paths at the individual level. The use of social networks provides a framework 
for identifying interactions and transmission and has been used to populate various complex 
systematic models and personalized risk models. However, the preponderance of available 
social network data relies primarily on reported network connections, resulting in a missing 
data problem and reducing the robustness of inferences that can be made. In this study, we 
sought to overcome these problems by utilizing a novel cell phone bluetooth network contact 
app to infer dynamic social network interactions, infection probability and transmission. 
Location and time based information from this app allow us to track personal daily contacts 
between participants. Proximity can be measured by the contact duration within a certain 
range. Similar social experiments have been conducted and mentioned in [Beckman et al. 


(2014); Dong et al. (2012, 2011), but these prior models assumed homogeneous individuals 
or global parameter sharing within the networks, and did not include data on potential 
modifying factors, such as personal health habits and demographic features of individuals 
in the network. 


1.1 Related Works 


Hidden Markov models are widely used in simulating the state space based disease progres¬ 
sion Ohlsson et al. (2001); Jackson et al. (2003); Sukkar et al. (2012); Dong et al. (2012), 


since HMMs actually mimic the behavior of an SIS model with discretized timestamps and 
are convenient in terms of the development of inference algorithm for different variants. In 


our work, we also adopt this basic framework to design our model. Other work like Zhou 


et al. (2012, 2013) proposed a group lasso formulation or multi-task learning framework us¬ 


ing optimization, and is far from our domain. A recent work by Wang et al. (2014) learned 


a continuous time series model looking at trajectories for different patients, but ignoring 
interactions between chains. Our work attempts to unify social network modeling of indi¬ 


viduals using a hierarchical structure. Christley et al. (2005) utilized a fixed social network 
analysis on susceptible-infectious-recovered (SIR) models to identify high-risk individuals. 


Salathe et al. (2010)’s work on close proximity interactions (CPIs) of dynamic social net¬ 


works at a high school indicated immunization strategies are more credible if extra contact 
data is provided. 

The idea of using hierarchies to improve model ffexibility is extensively studied in topic 
modeling, in models such as latent Dirichlet allocation (LDA) [Chang et aT (2010); Blei et al. 
(2010); Paisley et al. (2012). Chang et al. (2010) used a sigmoid link function, introduced in 


Relational Topic Model to learn fixed networks of documents. These, and further works have 
exemplified a trend in data-driven machine learning applications - hierarchical modeling is 
used to make inferences when the data structure is complex. Our work can be considered 


as a hierarchical extension of either GCHMMs Dong et al. (2012) or topic HMMs Gruber 
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Table 1: Notations 


Description 

n G [A^] 

index for participants 

te[T] 

index for timestamp 

se[S] 

index for observed symptoms 


covariates indicating personal features 

Gt-i 

dynamic social networks between t — 1 and t 

liln) 

recovery probability if infectious at previous timestamp 


probability of being infected from some one outside networks 

/3(A) 

probability of being infected from some one inside networks 

TT 

infection probability of initial hidden state Xn^i 

G A = { 0 , 1 } 

latent variable indicates whether infected 

yn,t e r = {0,1}^' 

reported symptoms during each time interval 

^Xn,t,S 

emission probability of symptom s onset given 


et al. (2007) with a nested transition function. From this viewpoint, we can also interpret 


our graphical model in terms of hierarchical LDA modeling. 


1.2 Notation 

Let N be the number of participants in the social community, and T be the days being 
tracked. The health record for each participant can be simulated as an HMM with T 
timestamps. Let y be the observation space of a Markov chain with hidden state space T, 
and initial probabilities tt. We also refer to the infection rate related parameters as 7 , a 
and /3. In particular, 7 gives the probability that a previously-infectious individual recovers 
and again becomes susceptible, a represents the probability that an infectious person from 
outside the community infects a previously-susceptible person within the community. /3 
represents the probability that an infectious person from the community infects a previously- 
susceptible person. These parameters are used to construct the transition probability of 
HMMs, whereas the emission probability 9x merely depends on the hidden state. 

Additionally, we introduce the notation associated with the mobile app survey or sensor 
logs. S is the number of symptoms in self report record. Temporal features in our survey 
are denoted by 2 ;. Since our discussion of HMMs varies between parameter sharing and 
the inhomogeneous setting, we temporarily did not include any subscripts to avoid ambi¬ 
guity. Clarification will be given in the subsequent sections. However, the overall notation 
description is concisely summarized in Table 

1.3 Network Data Imputation 

Our experiments are mainly conducted using both actively reported symptoms via mobile 
apps and passively detected location through bluetooth sensors. Occasionally, professional 
flu diagnoses will be given if symptoms imply a high risk of infection. Contact between any 
two agents can be inferred by the sensor collected location information. However, various 
uncertainties may cause the electronic data to be missing, such as dead batteries or a weak 
signal. This necessitates data preprocessing to recover the dynamic social network which 
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will be used for the next stage of our analysis. Relational modeling is not our current focus, 


so we use a nonparametric Bayesian dynamic network learning algorithm from Durante and 


Dunson (2014). 


Given Gt — {gij{t))NxN being a symmetric binary matrix to represent the dynamic 
social networks at timestamp t, let S(t) be the indicated symmetric probability matrix, i.e. 

Pr(^zj(t) = 1) = or equivalently gij{t) -- Bernoulli(^^j(t)). 


In Durante and Dunson ( |2014| ), they applied a [0,1] scaled transformation on a deliberately 
constructed latent space to obtain the time varying matrix S(t). A link function / : M ^ 
[0,1] or a prior can be imposed on the latent space. For simplicity, the commonly used 
sigmoid link function is introduced. 




where g{x) = (l + e In order to reduce the problem complexity and avoid modeling 

stochastic processes, the dynamic latent space can be expressed as 


Sij{t) = /i(t) + hi{t)~^hj{t), 


where /i(t) is a scalar indicating the baseline relation intensity, and h^(t) = hi 2 ^^ hinY 
is a vector in latent space with dimensionality H. This design also allows borrowing infor¬ 
mation to exploit the underlying process inducing similarities among the units. The prior 
imposed on the latent space variable is Gaussian process to capture the time dependency. 


h 

hihi-) ~ G'P{0,Tf^^CH),Th = J\9k,0i ~ Gamma(ai, l),(9fc -- Gamma(a 2 ,1), A: > 2 

k=i 


where ch is a squared exponential (SE) correlation function c/f (t, t') = exp (—||t — t'|p/2) 
with parameter and the prior of has a shrinkage effect. Similarly, the prior of baseline 
variables is /i(-) ^ 07^(O,Cy^) with SE function c^. 

This model allows missing data to appear in the observed since it can be easily 
sampled from the posterior of Ef. The latent space sampling greatly depends on a Polya- 
Gamma distribution, thus allowing the posterior computation to be performed with fully 
Bayesian inference. In this paper, we merely describe the generative process of their model, 
whereas the inference method involving Gibbs sampling using a data-augment at ion trick 


Poison et al. (2013) is beyond our scope and will be omitted. We refer the interested readers 


to these previous publications for details on the methodology. 


1.4 Contributions of the Paper 

We review the homogenous Graph-coupled HMM based ffu infection model within both the 
frameworks of Bayesian statistics and factor graph theory, and analyze various connections, 
for example, to the prominent work of Dong et al. (2012). We also extend this model based 
on covariates, introducing associated parameter heterogeneity, and apply it to personalized 
health data given via wearable equipment. To the best of our knowledge, both the model 
extension and the application have never been investigated in previous work. We also 
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provide a description of the relationship to topic modeling, showing that our model can be 
derived from standard LDA or HMMs. Using two available real-world datasets, we discover 
several reasonable and interesting phenomena from our experimental evaluation, that lead 
to intuitive interpretations of our approach. The practical results offer new possibilities for 
applications in future personal health research. Specifically, this paper 


(I) designs a more accurate approximation for the distribution of auxiliary variables 
to track infection source, especially when the values of g,/ 3,7 violate the near-zero 
assumption that only holds in the domain of epidemiology, allowing our model to 
potentially fit other application areas. 

(H) generalizes the standard Baum-Welch algorithm to GCHMMs under the constraint 
of sparse networks. The sparsity assumption imposed on daily social networks is 
reasonable in any real-world setting. 

(HI) characterizes person-specific infection parameters by imposing a covariate dependent 
hierarchical structure on GCHMMs. The link function is associated with covariates 
relates to temporal personal features, such as gender, weight, hygiene habits, and 
diet habits. We also adopt the hypothesis that better personal habits should result 
in lower susceptibility to influenza. 

(IV) explores both deterministic and probabilistic link functions in our model. In partic¬ 
ular, the sigmoid link function and beta-exponential prior can be generalized to the 
softmax and Dirichlet-exponential respectively. 


(V) develops an efficient parameter estimation algorithm for the non-conjugate prior. 
Inspired by Delyon et al. (1999), our proposed solver performs Gibbs sampling and 
optimization iteratively. A faster version of our EM-like algorithm for binary hidden 
variables is primarily used for our experimental tests, which significantly accelerates 
the computational speed without a significant impact on accuracy. 


1.5 Organization of the Rest of the Paper 


In Section we describe the basic idea of GCHMMs and discussion the Gibbs sampling 
and message passing methods used, where a connection and comparison will be presented 
through an illustrative example. Meanwhile, a new decomposition trick for the auxiliary 


variable distribution is visually compared with previous work Dong et al. (2012). In Section 


we test our proposed hGCHMMs by extending the GCHMMs to be both heterogeneous 
and hierarchical, explicitly discussing the two different link schemes and relevant parameter 
interpretation. A new model evolution from LDA to hGCHMMs is also depicted in this 
context. In Section]^ we focus on how to modify the EM algorithm to a Gibbs sampling 
version when the expectation is intractable with respect to hidden or latent variable. In 
Sectionwe report empirical results, applying it to semi-synthetic and real-world datasets. 
Conclusions and future research directions are discussed in Section 
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Figure 1: Left, top, an HMM graphical model. Left, bottom, a dynamic social network, 
Right, illustration of the formation of GCHMMs involving 3 people. 1 and 3 have 
social contact between t — 1 and t. The infection states of 1 and 3 at t are then 
both influenced by each others’ infection states at time t — 1. 


2. Graph-coupled Hidden Markov Model 


We first briefly introduce the graph-coupled hidden Markov model (GCHMM), evolving 
from coupled hidden Markov model (CHMM) [Brand et al. ( 1997| ), a dynamic represen¬ 
tation for analyzing the discrete-time series data by considering the interactions between 
Markov chains (see Figure for an example, where filled nodes are observed). The stan¬ 
dard CHMM is typically fully connected, between hidden nodes at successive timestamps, 
whereas the intrinsic sparsity of a dynamic social network can couple multiple HMMs with 
the possibility for fast inference. The number of parameters needed to be inferred will de¬ 
crease dramatically from 0{N^) to where n^ax is the maximum degree of hidden 

nodes. This advantage will further benefit our message passing algorithm and hGCHMMs 
in the subsequent section. 


2.1 Generative Modeling 

Let Gt — (A^, Et) be a network structure snapshot between timestamps t — 1 and t, where 
each agent or participant is represented by a node n ^ J\f = [ivQ in graph Gt, and Et is 
a set of undirected edges in Gt, where unordered pair {rti^rij) G Et if two participants ni 
and rij have a valid contact during the time interval (t — l,t]. The bottom in Figure a) 
illustrates an example of the dynamic social networks. Assuming that each participant n is 
represented a HMM with binary hidden state Xn^t shown as the top in Figure j^a), where 

1. [N] means a set including integers from 1 to A. 
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0 and 1 indicate susceptible and infectious respectively. The observed node is an S 
dimensional binary vector 2 /n,t ,27 • • •, 2 /n,t,S') an indicator for S symptoms. Thus, 

the generative model of GCHMMs is given in a fully bayesian way. 


TT ^ Beta(a7T, b^) 

a ^ Beta(ac^, ba) /3 Beta(a^, 6 ^) 7 ^ Beta(a^, b^) 

Oq^s ^ Beta(ao, bo) 9i^s ^ Beta(ai, 61 ) 
Xn,o ^ Bernoulli( 7 r) 

Xn,t ~ Bernoulli {(f>n,x^r.{n,n')€Gti(^, Ph)) 
yn,t,s ~ Bernoulli(6'2,„,t,s) 


where the transition probability (l>n,x^,:{n,n')eGt{^^ a function of the infection param¬ 

eters and the dynamic graph structure. This homogenous setting means all HMMs share 
the same parameters set or similar transition function. The difference is reffected as Figure 
indicated, the transition of the hidden state is not only dependent on the previous state 
of its own Markov chain but also may be inffuenced by states from other HMMs that have 
edges connected to it. One undirected edge in Gt indicates a valid contact in time interval 
(t — l,t], thus leading to a directed edge in GCHMMs. Recalling the definition in terms of 
7 , G, [3 (See Table Q, it is natural to construct the transition probability function as follows: 


:{n,n')eGt (T5 ^5 /^) 


7 

1-7 

^ (1-a)(l 


— O 5 

~ ~ I 5 

Xn,t — — I 5 

Xn,t — = 0 . 


( 2 ) 


where is the indicator function and Cn^t = J2n'-{n' n)eEt t=i} count of possible 

infectious sources for node n in in other words, it means that besides participant n, the 
number of other infectious nodes that have social contacts with n at the previous day. The 
epidemiological intuition is very simple, the more infectious people one comes in contact 
with, the more probable one is to be contaminated. This Bayesian formulation of GCHMMs 
can be applied to fit homogenous susceptible-infectious-susceptible (SIS) epidemic dynam¬ 
ics. Additionally, the Bayesian inference of this model (e.g. Dong et al. (2012)) can be 
reduced to a special case from our heterogenous model by getting rid of link hierarchy and 
personalization of infection parameters. We do not go into details here, but we generalize 
the Baum-Welch algorithm to sparse-coupled HMMs and relax the assumption of near-zero 
parameters, i.e. g, (3 and 7 are all 0. 


2.2 Generalized Baum-Welch Algorithm 

HMMs usually model independent sequenced or discrete time-series data, and represent 
long-range dependencies between observations for each data point, mediated via latent 
variables. We are inspired by the fact (Loeliger ( |2004[ )) that the forward-backward, Viterbi 
and EM for HMMs (see Murphy (2012)’s review) learning algorithms have their equivalent 
message passing formulation in factor graph representation. The independency property 
allows the three algorithms to be efficient and effective, while the blownup parameter space 
of standard coupled HMMs makes analogous treatment impossible and impractical. Thus 
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Figure 2: Factor Graph of Figure [^b) 


different assumptions have been imposed on CHMM, such as Zhong and Ghosh (2001). 
Fortunately, in our sparse coupling setting, a Generalized Baum-Welch Algorithm can be 
developed via a similar factor graph representation and enables efficient forward-backward, 
Viterbi and approximate EM algorithms. 

First, we describe the factor graph in Figure derived from the Bayesian network in 
Figurej^b). Commonly, factor graphs contain two types of nodes, variable nodes and factor 
nodes. In our model, the factor nodes can be categorized as emission probability functions 
and transition probability functions, 

^n,t,y\x,s{yn,t,s\Xn,t) — p{yn,t,s\Xn,t) — ~ (^) 

where the second equation defines the same function (§. Notice that the factor graph is 
undirected and direction information is comprised in the factor node. The factor graph is 
still shown with plates; each plate represents a participant while the interaction is captured 
by transition factor. In fact, during the running of the EM algorithm, all parameters are 
unknown, so (a,/3,7 can go at the top of the current graph, adding edges to all transition 
factors. For simplicity, we did not include them in our factor graph, though this widely 
used trick is introduced in independent HMMs (Loeliger ( 2004[ )). 


2.3 Forward-backward Algorithm 

Notice that even if the Bayesian network is not exactly a directed polytree (cyclic paths exist 
if omitting edge direction), marginal inference on each hidden node can still be approximated 
with belief propagation on the factor graph for the sake of efficiency. In this section, we 
will derive the single node belief propagation rules (Pearl ( 2014[ )) on Figure]^ Denote the 
message passing to the child, i.e. from Xm,t-i to Xn^t as and the message to 
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parent, i.e. from Xk^t+i fo Xn^t Xxk Then, the belief or probability distribution 

passed with all evidence shown is denoted as BEL(T^^t) = P{xn^t\Y)‘ Our derivation is 
based on Pearl’s belief propagation algorithm. All oc in the following imply the term should 
be normalized to 1 as a valid probability distribution. For notation simplicity, we further 
denote = pixn,t\x.,t-i) and Xx.^t+ii^n,t) = pix.,t+i\xn,t)- The principal part of 

forward-backward algorithm can be summarized as follows, where the detailed update for 
A and tt is derived in Appendix A. 

T^^'\xn,t)= F H ^xlt(X;t-l) (5) 

nU{n':{n' ,n)eEt-i} 

S 

( 6 ) 

5=1 nU{n':{n,n')eEt} 

BELW(a;„,t) oc Tr^'^Hxn,t)X^"\xn,t) (7) 

Even if it has been proven that belief propagation on standard HMMs is equivalent to the 
forward-backward algorithm, the update step for message passing in our case is a little 
more complicated since the single node dependence is generalized to multi-nodes. Let the 
maximum degree of all GfS be M, then the update of sum-product for message can be 
computed with complexity 0(2^) at each iteration. This is the reason why the sparsity 
assumption is required in our algorithm. In addition, the initialization for all messages from 
variable nodes to factor nodes, such as TTa,. (x.), (x.), can be set to all Is. 

Viterbi algorithm can be derived in a straightforward way, if the sum-product fl 
forward-backward is substitute by max-product max]^. Since the message updating step 
is almost the same, it won’t be discussed here. 

2.4 Approximate EM Algorithm 

In this section, we will put forward a parameter learning scheme via the generalized Baum- 
Welch Algorithm. It is straightforward to derive the expected complete data log-likelihood 
given by 

N r T T s \ 

Q{Q^Qold^ = EE olog 7 r + (1 - a;n,o)log(l - tt) + ^\og + 'P^'P^og4'n,t,y\x,s > Pr(X|Y, 

X n=l I t=l t=l s=l J 

( 8 ) 

This is exactly the E-step in EM algorithm, and the non-approximate M-step can be opti¬ 
mized for parameters tt, Ox and 7 . In fact, due to the conjugacy of these parameters, their 
posterior distribution can also be analytically computed. Taking the partial derivative on 
Q, i.e. 


dQ(e, _ dQ(e, _ dQ(e, 0°'^) _ dQ(e, 0°'^) 


(9) 
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By solving above equations, we obtain the update formula for corresponding parameters. 


EliEKo] 

N 

Efal y(x| Y. e°“) 

(Et, EL p(X|Y, 


( 10 ) 


Notice that it does not matter if we change the p(X|Y,©^^^) to p(X,Y|©^^^), and 
^ ^ ~ ® result in an analogous computation as for 7 for the 

iteration. However, except for tt, the exact computational complexity of the iteration step 
for other parameters is intractable, exponentially increasing with N or Since we did not 
assume near-zero parameters, the induced non-conjugacy requires further approximation in 
the M-step. If we approximate P(X|Y,©^^^) = Hn t ^ factorized 

form, then all the optimized results for 6 would update analytically, because |Y, ©^^^) 

(i.e. BFjL{xn^t)) can be computed by the forward-backward algorithm as derived before. 


_ En=l J2t=l yn,t,sHh=o{^ - Xn,t) + h=lXn,t] . _ 


Oi 


En=l Et=l ]E[Ij=o(l Xn^t) + ^i=lXn,i\ 


0,1 


( 11 ) 


Updating 7,(a,/3 is a little tricky. First we introduce the approximation for 7 , which 
will make the other two updates more understandable. Even if we use full factorization in 
the approximation, the update of 7 is associated with variables Xn^t-i and Xn^t- A natural 
idea is to use Monte Carlo methods to sample ^ from p(X|Y, 0°^'^) = 

n^ Then we can count the number of times that event Xn,t-i = 1, Xn,t = 0 

happens. For simplicity, we can directly assign the simulated sample by Bayesian decision 
strategy according to each |Y, ©^^^) instead of sampling. That is to say, we only need 

to set the sample = argmax^^ ©^^^) and give the following result. 


7 ^ 


Z^n=l 2^t= 


1 ■^:rn,t-i=l,^n,t=0 


Z^n=l 2^t= 


1 


EtiE;.iE[a,„,,_i] 


J2n=l 


( 12 ) 


The same trick can be applied to update (a,/3. However, along with this approximation 
trick, we also need variable substitution. Let Tj = (1 —(a)(l —then a = 1—tq. Therefore 
the update of a and r^, i = 1 ,..., M is analogous to 7 . 


a 


Ti = 


E7i Yh=1 fe„,t_l=0,a:n,t=lllEn':(n.»')eSt_i 


E 


N 

n=l 


En=l Et=l Xn^t-l] 

Et=l ^^n,t-l=0,a:n,t=olEn':()i,n')SEt_l ^n' 

e7iELiE[i-x.,_i] 


(13) 

(14) 
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Data: Y, G 

Result: parameter set © = {tt, 0, o,/3, 7 } and hidden matrix X 
Initialize coefficient parameter ©(^); 

repeat 

Run one iteration forward-backward algorithm on ©^^^ to obtain 
Update ©^^^ based on equations from (10) to (15); 
until ©^^'^ Convergence] 

Algorithm 1: Generalized Baum-Welch Algorithm 


Performance of Gibbs Sampling 



1 

0.95 

0.9 

0.85 



0.75 

0.7 

0.65 

0.6 


Performance of Belief Propagation 



— ^ with para 
— ^ w/o para 


1 23456789 10 

Iterations 


Figure 3: The axis of Gibbs sampling is scaled by logarithm. Both algorithms can achieve 
an accuracy above 98.5%. 


/3 = 1 — j ^ 5 thus meaning we have M estimations for /3s. How to combine these /3s to 
obtain a better estimation may vary for different applications. We suggest one possibility 
using averaging that can be adjusted under various circumstances. In summary, we organize 
our generalized Baum-Welch Algorithm for GCHMMs in the following framework 


M 


Ti 


1 — a 


(15) 


2.5 A Simulation Study 

Synthesized data based on Real Social Networks 


We leveraged a dynamic social 


network dataset with 84 nodes over 107 days in Madan et al. (2012) denoted as G 84 x 84 xi 075 
we modified it to make its maximum degree be bounded by constant M = 11. Based 
on the generative model, we simulated infection states X84xi08 (including X.^q without 
emission) and observed symptoms data Y84xi07x6- We ran inference using two algorithms 
Generalized BW (GBW) and the Gibbs sampling developed by Dong et al. (2012) in two 
different settings, when parameters are known and unknown. Notice that GBW with known 
parameters is reduced to forward-backward belief propagation. Both algorithms won’t take 
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X as arguments but predict an X. Gibbs sampling and GBW are run for 500 and 15 
iterations respectively. The Gibbs sampling implemented in the experiments is burns-in 
half of the total iterations. Since X is a binary matrix, a threshold of 0.5 is used for 
prediction. 

In order to observe the performance with respect to the number of iterations, we con¬ 
duct experiments testing the predictive accuracy of X with results shown in Figure]^ With 
parameters known, both algorithms can achieve good performance within a few iterations, 
while in the unknown case, the generalized BW algorithm gives better performance than 
Gibbs sampling over fewer iterations. However, the excellent performance of GBW is depen¬ 
dent on the initialization of parameters. If they are chosen inappropriately, GBW may not 
perform well, a common disadvantage of EM algorithm. In addition, with more iterations, 
Gibbs sampling will give more robust prediction results. 

3. Extending GCHMMs to Hierarchies 

Though we successfully design a generalized EM algorithm for GCHMMs, we still overlook 
temporal covariates z^. In epidemics, it is commonly assumed that personal health features 
(covariates) are relevant to influenza vulnerability. In our model, G 7^^, where K is 
the dimension of the covariate feature space. Without loss of generalization, the feature 
is a constant of I in the feature space we are looking at. This relevance is captured by 
the mapping / : TZ^ [ 0 , 1 ] or the transformation from the feature space to infection 
parameters. In this section, we propose two constructions using two different link functions. 
A natural way to go is to extend the beta prior of the standard GCHMM to a beta- 
exponential link. 

Beta-Exponential link 


»7-,-~ •V(^, S) (16) 

7„ ~ Beta(exp(z)[77r,i),exp(z([77^,2)) 
an Beta(exp(z)[)77a,i),exp(z)[)77<j,2)) 

I5n ~ Beta(exp(z)[77fe,i),exp(z)[)7772)) 

where r]. . is distributed as a multivariate Gaussian, playing the role of the regression co¬ 
efficients, since the expectation - y ^ _ —- can be considered as an approximation for 

V ^',1 ^', 2 / 

logistic regression with coefficients — — 'n-,2)- This link also enables the exponential 

term exp(z^? 7 .^.) to take the place of the hyper-parameter of the beta prior. The usual 
count update to the hyper-parameter will implicitly update r] via our EM algorithm. 

Once 7 , a, /3 are allowed to be indexed by n, a new transmission function merely 
needs an index modification of the arguments in Q, but otherwise remains the same, 
i.e. t^n)- The advantage of this setting is that it allows for the ap¬ 

proximate Gibbs sampling of infection parameters in a way that still holds for GCHMMs, 
except for 77 , so that the original Gibbs sampling scheme to update the beta distribution by 
event counts is the same in the later E-step. Another advantage is, when X is generalized to 
categorical variables, a similar construction also works. We use an individual level distribu¬ 
tion with the transition but not with the emission matrix because it makes more sense that 
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everyone has the same probability of a physical behavior given an infection state. Patients 
should have corresponding symptoms, such as cough or throat pain, or the flu cannot be 
discovered or diagnosed. Instead of the Beta-Exponential distribution, we can introduce a 
deterministic link. 


Sigmoid link 


T]. ~ E), 7„ = a{zlr]r), an = Pn = (^i'^nVb) 


(17) 


In this generative process, less r/s are present, thus leading to a simpler model. Instead of 
sampling, the equation ( [T7| ) will actually make infection parameters vanish in the model. In 
other words, <?!>n,a:„,:(n,n')eGt( 7 n,^n) IS replaced by From an 

implementation perspective, the EM derivation will be easier, and the experimental results 
imply its performance is more competitive. Additionally, for both link constructions, the 
generative model 0 is also individually indexed by n. 

Another interpretation of I3n In above two extensions, it is implicitly assumed that 
f^n means the individual infection probability from another person within the network, is 


as given in Equation (18). From a biological perspective, the contagiousness of the infected 
person varies, meaning that I3n can be interpreted as the probability of spreading illness to 
any other person in the social network. This heterogenous inconsistency will not appear 
in the previously discussed homogenous setting. However, the second interpretation results 
in a slightly complicated mathematical calculation (details in inference section), since both 
the total count of infectious contacts Cn^t ^^nd their individual features are required. Thus, 
the probability of infection has two different definitions. 


P{Xn,t+l = = 0 ) = 1 - (1 - a„)(l - 

P{Xn,t+l = l\Xn,t = P) = I - {I - Oln) H ~ Pn 

n'eSn,t 


(18) 

(19) 


where the node set of infectious contacts is defined as Sn^t — ^ W] • (^5 '^0 ^ = l}* 


3.1 Model Evolution 

HMM Along the direction of graphical representation in Figure [TJ the hierarchical GCH- 
MMs shown in Figurecan be seen as a two-step evolution. First, each HMM is allowed to 
contain its own infection parameters, thus evolving to a heterogenous GCHMMs. Second, 
the link introduced previously can be used to associate with covariates. This scheme also 
inspires another corresponding two-separated-segment training: Gibbs sampling to estimate 
the posterior mean of ( 7 ^^, an^ Pn) for heterogenous GCHMMs, and fitting a standard logistic 
regression between the posterior mean and covariates. 

However, there are some things that we need to notice. ( 1 ) The beta-exponential genera¬ 
tive model is not well defined for unique dataset simulation, because we need only one sample 
from this generative model, which actually uses a set of fixed parameters (g^,/ 3^, to 

simulate X^vxT and Y^xTxS- Take 7 ^^ for example, 7 ^ ^ since 7 ^ is one realization 

of the generative model. What our algorithm aims to learn is a generative distribution 
Beta(e^’^^"’’b with expectation equal to 7^, rather than the real E[77i]. ( 2 ) There¬ 

fore, in experiments our simulation dataset is always generated from the sigmoid link model. 
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Figure 4: The hGCHMM extended from Figure [^b): for plotting simplification, the edges 
from 9x to other observed nodes are blanked out; the two step evolution, 
GCHMMs^heterogenous GCHMMs^hierarchical GCHMMs. 



Figure 5: (a) is the commonly represented LDA model; (b) is an equivalent but different 
graphical representation of LDA; (c) is our proposed model, adding topic depen¬ 
dency, document dependency and document-specific features. 


However, it is reasonable to use the beta-exponential link model for inference to eliminate 
the inconsistency. It has been mentioned previously the expectation of the beta-exponential 
is basically an approximate logistic regression. An EM like algorithm can perform point es¬ 
timation well for this expectation, which in turn would be an estimator of the sigmoid link. 
(3) Another way to make the beta-exponential generative and the inference process work 
is to sample both individually and dynamically, i.e. A number of 

samples are sufficient to learn the true generative distribution, though the new inference 
algorithm will necessarily become more difficult. 

LDA As shown in Figure [^b), LDA can be trivially represented as an unrolled graph 
for T documents. If we impose a Markov dependency to model topics changing, an em¬ 
bedded HMM appears with respect to the latent topics. Furthermore, we construct the 
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topic changing function relying on the document relationship and a link associated with 
document-specific covariates, thus resulting our hGCHMMs. A similar variant is LDA- 
HMM Griffiths et ah (2004), which requires extra hidden nodes essentially to model syntac¬ 
tic or function words, such as ’’and” or ’’however”, introducing a sentence-level dependency, 
while the HMM imposed on topic nodes gives a word-level dependency. 


4. Inference for hGCHMMs 
4.1 Approximate Conjugacy 

The inference process is designed to invert the generative model and to discover the r] and X 
that best explain G and Y. In our hierarchical extension, however, a fully conjugate prior 
is not present and knowing what the right prior is can be difficult. Thus an approximate 
conjugacy is developed by introducing the auxiliary variable Rn,t^ representing the non¬ 
specific infection source (inside or outside networks). The idea is to decompose infection 
probability In^t = 1 — (1 — Gn)(l ~ into the summation of three terms, an{l — 

(1 — Gri)(l ~ (1 ~ ~ (1 ~ indicating infection from outside, inside 

and both respectively, thus following a categorical distribution: 

if outside infection, = 1 
if inside infection, R^ t = 2 (20) 

if both, Rn^t — 3, 


P{Rn^t) = 


(l-an)(l-(l-/3n)^^-^ 


The exact expression still does not have Beta-Bernoulli conjugacy except for the case where 
Rn^t — 1- However, using a taylor expansion we have P{Rn^t — 2)P(x^^t+i = l\xn^t = 0) 
Gn,t(l-Gn)/3n and P{Rn,t = 3)P(x^y+i = = 0) Cn^tc^nl^n- The two approximations 

have the property that local full conditionals can be analytically obtained by discarding r] 
temporarily. In practice the term involving P{Rn^t — 3) can be approximated as 0 for Gibbs 
sampling. Because of the biological application, an and (3n are both a positive real value 
close to 0, resulting in their product being quite small. Even if this probability is taken 
into consideration in Gibbs sampling, there is a very small chance that Rn^t — 3. This 
approximation allows the posterior distribution of an^ Pn fo be much easier to compute 
given the current value of r]. 


In addition, our approximation works better than the proposed decomposition in |Dong| 

+ Cn.tl^n- In Figure]^ a quantitative comparison between our 


et al. (2012), i.e. 


n,t 


- 


proposed approximation and previous work indicates less error achieved by our decompo¬ 
sition. Specifically, if Cn^t — 1? onr decomposition recovers I exactly (blue line and red 
line are overlapped); with Cn,t increasing, both of the two approximations are biased, but 
our approach has constant error regardless of varying g, and shows less error for relatively 
bigger /3; moreover, the induced approximate distribution through our three terms is almost 
in line with the true distribution (20). The main advantage of the novel approximation is 
the possibility of deriving the fully conjugate posterior. Specifically, we have the following 
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(a) Approximate I w.r.t a 


(b) Approximate I w.r.t /3 (c) Approximate p{R) w.r.t a 


Figure 6 : Comparison between different approximation methods 


posteriors, which will benefit from the EM algorithm described later: 

an ^ Beta Cn^Rn=i^ 2 >-> + Cn^R ^=2 + 

I3n ^ Beta Cn^R^=2,3^ + Cn,R^2,3^ 

-in -- Beta(e^^^"’i + + Cn.i^i) 

where the count notations are defined as follows. 


Cn.i^j = ^{xn^t=^i 



Cn,Rn = l,3 = 'Et^Rr. 

,,t = l,3| ~ Cn,Rn = l = EAjRn, 

t=l} 

C'n,i?„=2,3 = Yjt 

,.*=2,3} ~ C'n,i?„=2 = EA{-Rn. 

*=2} 


= +I{xn,t=0,x„,t+1=0}. 



Note that the auxiliary variable did not appear in the posterior of 7 ^, which can be 
exactly computed due to conjugacy. Utilizing these approximate posteriors, the complete 
likelihood P(X,R, r/jZ) is obtained by integrating out the infection parameters. 


/ 


P(R|X, /3 ^'Li)F(X|7^'Li, /3 r=i)^(7r=i, <=i, |r?, 

+ Cn,1^0, + Cn,R^=l, 3 , e-nria,2 + Cn,R ^=2 + Cnfl^o) 


^piv) n 




B [e^nVa,l^eZ:^Va,2'j 


B{e^nVb,l -I- (7n,R„=2,3; + Cn,R^2,3) 


B^e^nVbA 


( 21 ) 


where R(-) is the beta function, and P{r]) is a multinomial Gaussian distribution. The 
integral result enables the analytical computation of the gradient and 2 ^^ derivative of 
the log-likelihood, used in optimization via Newton’s method. Under such f3n interpretation, 
the derivatives of the log likelihood ( 21 ) are straightforward. 
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4.2 Gibbs Sampling for the Conjugate Part 

Sampling Infection States Given the generative model implies a conjugate 

prior for Xn^t- The unnormalized posterior probability of Xn^t — can be represented as 

pL’* = O’l- 


pI 


t QC 7n - a^)Un,t-l=0.-n,t+l=0}(l _ (22) 


pit ^ 7n • (1 - - 4,3)'^-*-=° (23) 


where the normalized posterior of p{xnt = 1) is q . We need to be careful of the 

’ Pn,t^Pn,l 

boundary condition since Xn^i and Xn^r do not have this form. Xn^i is generated by Xn^i ^ 
Bernoulli(7r), where tt ^ Beta(a7T, 67,-). The full conditional depends on the initial event 
occurrence rate tt, further requiring some mild modification. The full conditional of tt can 
be efficiently derived. 


7r|X ~ Beta ( b^ + N -Y^ I(a:„,i=i) 

\ n n 

For the state Xn^T^ the posterior is easily computed since terms associated with t + 1 cancel 
out immediately. 

Sampling Missing Observations For real world data, a missing value problem com¬ 
monly arises because of underreporting in data collection. Bayesian schemes can successfully 
fill in missing values by drawing yn,t,s according to the distribution Bernoulli(0^^ ^^5), if they 
are NA. Given yn,t,sj the posterior of Oi^s^ {i — 0,1) is from a beta distribution. 


Y ^ Beta 


( ttj + ^. 
\ n,t 


■{yn,t,3=l,a:n,t=i}’^» + T^fan.t.«=0,a;n.«=O 


n.t 


4.3 Burn-in Gibbs EM Algorithm 


Previous works on CHMMs or GCHMMs seldom include the parameter 77, let alone a 
sampling scheme for inference. In hGCHMMs, the Gaussian prior makes the posterior of 
T] not conjugate. One possible solution is the Metropolis Hastings (MH) algorithm due 


to the approximate likelihood (21); however, the transition kernel is difficult to choose for 


MH, and running large numbers of iterations is usually required to achieve good mixing. 
Another thing to try may be an augmentation trick based on the Poyla-Gamma distribution 


Poison et al. (2013), which is mentioned for the network imputation in the introduction. 


The drawback of this scheme is that it can be straightforward to fit the sigmoid link, while 
the beta-exponential prior may need further generalization. Variational Bayesian inference 


(see Beal (2003) for a detailed introduction) is commonly used for approximate inference 
by optimizing a lower bound. If the readers are familiar with variational methods, it is 
obvious that for the conjugate-exponential family the update for parameters (in our model. 
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^5 ^x^l) can be written out analytically, equivalent to the posterior derivation. However, 
for other parameters associatiated with non-conjugacy, a gradient based method is the first 
option to explore, such as stochastic variational inference (SVI, Hoffman et al.| ( 2013D ), 
unless another lower bound with respect to previous lower bound can be found (e.g. as an 
illustrative example in Figure |^c), — 2) is approximated by lower bound but — 1) 

by upper bound. Strictly speaking, this approximation cannot be used as the lower bound 
of lower bound). 

In this section, we propose a fast algorithm based on expectation-maximization. In 
hGCHMMs, expected sufficient statistics are computationally intractable since there is no 
closed form for integrating out the latent variables. Stochastic Approximation (SA) or 


Monte Carlo (MC) EM by Delyon et al. (1999); Wei and Tanner (1990) is an alternative 
introduced to simulate the expectation, and it is able to obtain convergence to a local 
minimum with a theoretical guarantee under mild conditions. The basic idea is to use a 
Monte Carlo sampling approximation; however, we replace this step with Gibbs sampling 
by utilizing the approximate conjugacy property. 

E-step: Sampling and follows 


ttn, A,7n|Z,77(^ 

^ I 7 l^n 7 Kn 7 Y 

R|X, ^727 ^ 


(24) 


The true expectation or intractable integration is approximately calculated by a 

stochastic averaging in a burn-in representation defined as (26), taking advantage 

of Gibbs sampling. During each E-step, infection parameters are in fact always updated 
at each inner iteration of Gibbs Sampling, thus making the latent variables X, R update 
based on different posterior distributions at each sampling, which disagrees with SAEM or 
MCEM, sampling latent variable from a fixed distribution based on estimated parameter by 
previous M-step. Therefore the samples at later Gibbs sampling iterations are closer to the 
true posterior given current From this perspective, the Gibbs sampling in E-step 

may essentially accelerate the convergence rate in the next maximization step. 

M-step: Maximizing with respect to r/, i.e. argmaxQ^^^(r/). However, directly opti¬ 
mizing will suffer from the same drawback as in standard EM. Pathological surfaces 

of the log-likelihood may be present via saddle points and local optima, meaning that the 
algorithm is sensitive to initialization. [Delyon et al.| ( [1999 ) argued that the augmented 
objective function = (1 — can avoid this problem partially, 

where usually takes few samples to introduce a stochastic property, and 5^^^ 

small positive step size, essentially requiring the conditions in 


IS a 


hm = 0, lim = 1, 

k^oo ^ 


OC 


k^oo 


(25) 


The intuition to solve this intractable objective lies in jCelenx et al.j ( |1995D , showing that 
this optimization can be updated by = (1 —where 

is the true EM result approximated by MCEM with large sampling size, and Vsem 
special case of MCEM in very few samples or even unique one sometimes. 
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Data: Z, Y, G, sampling size J, burn-in iteration S, step size series 

Result: r] and X 

Initialize coefficient parameter 77 ^^^; 

repeat 

for i ^ 1 to J do 


sampling according to (24); 


end 

Compute 


QbGEM(^) ~ 


J-B 


j 

Y, log(p(X(^tR(^tr/|Z,r/ 


(fc-i) 


j=B+l 


einO) = log 


Optimization 

^bGEM ~ ^^S^^^QbGEMV^) 
^SEM = arg max {r ]); 

Combination = (1 - <5l^l)»?bGEM + 

until Convergence] 


(26) 

(27) 


Algorithm 2 : burn-in Gibbs EM Algorithm 


Generalizing this scheme to the Gibbs sampling setting, we formalize Algorithmic where 
QbGEM takes the sample average of the Gibbs algorithm and Qsem takes the last sample. 
OsEM is a stochastic perturbation of EM, and is expected to search more stable points. The 
algorithm starts by optimizing Qsem with = 1, making the search area large for the 
first few steps. Then it focuses more weight on optimizing QbGEM- A theoretical guarantee 
for this algorithm can be illustrated by using two convergence bounds; Birkhoff Ergodic 
theory Durrett (2010) and Theorem 7 in Delyon et al. (1999). 


Faster version for binary latent variables Because taking the first order derivative 
with respect to r] and setting it equal to 0 will obtain a non-analytical root, gradient descent 
based optimization is necessary, and we adopt Newton’s Method by taking the advantage of 
curvature information. Taking the inverse of the Hessian matrix usually requires algorith¬ 
mic complexity 0{K^). The dimensionality of r/ is A which is independent of HMMs scale 
A, and a PGA preprocessing will reduce it significantly, where the necessity is illustrated in 
experiments section. Though K representing the number of temporal health feature is less 
scalable in most application, there may still be a high cost to computing the Hessian with 
0{JK‘^) complexity due to matrix addition, unless there is a parallelized implementation 


with reduce operation 

Dean and Ghemawat 

(2008 

). Thus we need to seek for more efficient 

algorithm. For Gaussian variable. Price ( 
ablity of the derivatives and expectations. 

1958) prove a theore 
Rezende et al. (2014 

m to address the exchange- 
) implemented this idea in a 


non-Gaussian posterior likelihood and obtained good performance by approximating expec- 
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tation with unique delicately designed sample. An improved SAEM coupled with MCMC is 
discussed in Kuhn and Lavielle (2004), which also argues that only one sample is required 
in the E-step if an appropriate Markov transition kernel is also used. 


Consequently, we mimic these two ideas to design our MC integration with a single 
sample. Technically, if we omit the probability of tracking a source belonging to both 
inside and outside the network, latent variable Rn^t can be considered as binary vari¬ 
able as well. Then, we can use the posterior mean of the latent variable as the sample 
we have been looking for. Therefore, at the kth iteration of EM, the pseudo-sample is 
constructed via a Bayesian decision rule based on the burn-in posterior mean in Gibbs 
sampling, i.e. Xn,t = > 0-5} and A,t = E/=s+i > 0-5}- 

This means that a unique set (X,R) is sufficient to approximate that is to say, 

log(P(X, R, ?7)|Z, substitutes for Q|)Qem* trick applied on non-Gaussian vari¬ 

ables is not theoretically guaranteed but has been broadly used in EM or other optimization 
problems, by assuming a fully factorized joint distribution. In our binary variable case we 
found that it made no significant difference on accuracy whenever this trick is applied, in 
practice. 

(k) 

Optimization To optimize fhe kth M-step, the update formula by the Newton- 

Raphson Method is briefly outlined in this paragraph, excluding the analytical gradient G 
and Hessian H computation. For efficiency, we update parameters as follows, with a few 
iterations. 


(k) _ (k) XU—1 

'^^EM-.new ~ '^^EMiold ~ 


G 


where *EM varies according to different estimators, bGEM or SEM. It is unnecessary for 
there to be complete convergence in order to guarantee A similar 

idea with a single iteration is mentione d in [Lange ( 1995[ ). The step size S ensures that the 
Wolfe conditions (Nocedal and Wright ( 2006[ )) are satisfied. The intuition in adding in step 
size here is, compared with gradient descent, is that Newton’s Method tends to make more 
progress in the right direction of the local optima, due to the property of affine invariance. 
This probably leads to an update where the step size is too large, so it is better for stochastic 
algorithms to enlarge the search domain at first then shrink later. 


4.4 Short Discussion on Sigmoid link 

A sigmoid link function benefits from model simplicity and hiding the infection parameters 


without the necessity to integrate them out. The likelihood P(?7 ,X|Z) shown in (28) can 
thus be exactly computed. It means that we can estimate parameters throughout either 
standard SAEM by getting rid of latent variable R immediately, or bGEM by introducing 
R in E-step and a faster M-step by keeping X alone. 


N 


N T-1 


P{v) n X n n (28) 

n=l n=l t=l 


1 - {1 - a{z^r]a))il - a{z^r]b)) 
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4.5 Further Discussion on 


In the second biological interpretation of (the probability of infecting others), transi¬ 
tion function (^n,x^r.{n,n')eGt bccome dependent on a parameter set {/3^/ : n' G Sn^t}- 
Consequently, the posterior of each requires both a count number and source tracking 
(conceptually, this is like a ’’pointer” in the C programming language). However, the likeli¬ 
hood of the beta-exponential model can be simplified to integrate out these parameters due 
to the auxiliary variable as well, corresponding to a new approximate categorical dis¬ 
tribution, though P{Rn^t) in previous (20) actually aggregates the probability with respect 
to all equal /3^s. The new categorical distribution and its induced completed likelihood can 
be represented as follows. 


PiR f) ~ Categorical { _\ 

(29) 

(30) 


P(X,R,77|Z) = P(77)n 


' B{e-nVrA + Cn,1^0,e-nVr,2 + 

^(gzjr7,,i,gzjr7,,2) 


e"n^a,2 + + ^,0^0 S(e"n^M + 6^^ ^6,2 + 




B{e^nrib,i^e^nrib,2^ 


where takes the value {0,1, and 0 means there is an outside network source 

and other integers refer to specific infection in-network sources. The categorical distribution 
makes the beta prior for the infection parameters conjugate in the posterior. However, the 
integral for the likelihood is actually difficult and needs some tricks, especially for because 
of the source tracking (see Appendix B for details). Note the likelihood for sigmoid link can 
be derived analogously. The new count notations are listed below. 


Cn,Rn =0 — ^{Rn,t= 0 } _ 

Cn.RnT^O — '^n'eSn,t ^{Rn,t=n'} _ 

Cn,R=n = ^n'.UneS^, ^ ^R^, ^=n} _ 

Cn.R^n = ^n',t:neS^; ^ihRn',t=^} ^{a^n^t=Q^^n^t+l=Q}] 


5. Experimental Results 


In this section we illustrate, on simulated data, the performance of our approach, hGCH- 
MMs and the burn-in Gibbs EM algorithm on three datasets for the purposes of predicting 
the hidden infectious state matrix X, filling in missing data - observations Y, and inferring 
an individual’s physical condition based on parameter estimation. Further application on 
the public real world Social Evolution Dataset Madan et al. (2012) and our mobile apps 
survey dataset are also shown. 
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(a) Synthesized X 



(b) X by Sigmoid link 



(c) X by Beta-exp link 


Figure 7: Infection state prediction is visualized by a heatmap. For (b) and (c), the redder 
the point is, the larger the probability of getting infected is. 


5.1 Semi-Simulation Dataset 

5.1.1 Data Generation 

Differing from completely simulated data or a totally artificial setting, we employed a gener¬ 
ative model to synthesize X,Y based on the real dynamic social network Gt and covariates 
Z from the real Social Evolution dataset. The predefined X then plays the role of ground 
truth, making evaluation for all above points possible. 

Real Data The public MIT Social Evolution dataset contains dynamic networks G84x84xi07 
including 84 participants over 107 days, Gt indexes 1 to 107, and covariates Zn exist for 
each participant, including 9 features, weight, height, salads per week, veggies fruits per 
day, healthy diet level, aerobics per week, sports per week, smoking indicator, and default 
feature 1. The quantity per week is frequency. Weight and height are taken as real values. 
Healthy diet includes 6 levels ranging from very unhealthy to very healthy based on self 
evaluation. The smoking indicator is literally a binary variable. Real symptoms Y are 
temporarily discarded since the true infection states X are unavailable for this dataset. 

Synthesized Data X and Y are then generated based on our proposed generative 
model. Hyperparamter r] needs to be predefined, which means synthesized infection pa¬ 
rameters ^rc known because of the sigmoid function for data simulation. Only 

synthesized data Y with 6 symptoms is given to the model, but the evaluation is done on 
other variables. The proportion of missing values in Y is set to 0.5, i.e, the observations 
yn,t,s are NA with probability 0.5. Our generated X84xi08 (including initial states at times¬ 
tamp 0) is shown in Figure [^a). Each row vector represents a person’s infection states 
during the entire observation period. 

5.1.2 Model Evaluation 

We ran the algorithm 10 times. The prediction performance on latent variable X is the 
byproduct of the E-step, and when Xn^t is larger than the threshold 0.5 the person is 
diagnosed as being infected. Since X is completely unknown to the algorithm, held-out 
test data prediction is unnecessary, while the whole matrix X is used to evaluate prediction 
accuracy. Figure [^a-c) shows the difference between the truth and the inferred results for 
each of the two linked models. The posterior mean from Gibbs sampling for prediction. 
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Prediction accuracy on infection states X 


Norm of error vector: gamma 



Beta-exp link Sigmoid link Gibbs logreg 

(a) 

Norm of error vector: alpha 



Beta-exp link Sigmoid link Gibbs logreg 

(c) 
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Beta-exp link Sigmoid link Gibbs logreg 

(d) 


Figure 8: (a) Accuracy comparison on latent variable prediction; (b-d) Error bar compar¬ 
ison on parameter estimation; Notice the A/^-dimensional error vector is normed 
to a scalar for statistical comparison. 


in both beta-exponential and sigmoid models, leads to a real value in the interval [0,1]. 
Figure a) reveals a quantitative measurement on accuracy with a standard deviation. 
As mentioned before, the baseline model is a two-step algorithm including the standard 
GCHMM and logistic regression is also implemented and compared to. The rightmost 
barplot in Figure |^a) shows its predictive performance. The GCHMM needs to run at 
least 2000 iterations of Gibbs sampling to obtain good mixing, while in our approach, we 
only run about 50 inner Gibbs sampling iterations in E step and less than 10 outer EM 
iterations. 
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(a) 7n 


(b) an 




(c) 


(d) PCA Justification 


Figure 9: Scatter plot illustration of (e-g) Horizontal axis: estimation; Vertical axis: ground 
truth.Colinearity elimination on 7^: PCA justification 7 is to obtain the regressed 
slope close to 1. 


Figure l^b-d) displays the predictive error of the forecasted infection parameters. Since 
the infection parameters are individual specific, the estimation is in fact a vector of length 
N. Therefore we used the 2-norm of the error vector for statistic summarization and further 
comparison. It is apparent that the sigmoid model gives the best performance on latents X 
or 7n, <^715/3n5 in terms of the generative model. The defined Beta-exponential Model, as a 
probabilistic substitute to the approximate sigmoid transformation on infection parameters, 
proves it is competitive for parameter estimation. However, standard GCHMM with logistic 
regression, as two independent parts of the sigmoid model, provides an unreliable prediction 
on individual-specific parameters, albeit its excellent latent variable inference. All three 
inference methods use Gibbs sampling to infer the posterior mean of X. This is most likely 
the reason why they share similar performance. 

5.1.3 Individual Parameter Analysis 

From the perspective of general health care or disease control for large populations, 77 is of 
concern (discussion on a real biological dataset later). However, as for individual treatment 
and personal medical advice, are more significant for physical health. Better 
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(a) y .,5 (b) y .,5 (c) Predicted X 


Figure 10: Epidemic state inference on real Data: (a) shows the true reported symptoms 
by 84 participants at day 5; (b) gives the one step ahead prediction of (a); (c) 
is predicted infection. 


immunity usually indicates a smaller but a larger 7 ^. In our model, these parameters 

are designed to correlate with personal health habits by using a link with influence coefficient 
T], The prediction of the infection parameters on raw data Z is shown in Figure [^a-c). This 
illustration is consistent with the error bar plot in Figure[^b-d). The predicted values of our 
proposed models are distributed with higher concentration on the diagonal of 2D-coordinate 
plane, while standard GCHMM + logistic regression has relatively larger variance. The 
underlying linear slope for 7 ^ seems inconsistent with the diagonal. This phenomenon can 
be blamed on the colinearity of Z. Looking at the names of the covariates (BMfl weight, 
height, salads per week, veggies and fruits per week, healthy diet or not, aerobic per week, 
sports per week, smoking or not), correlation obviously exists. Thus, we apply Principal 
component analysis (PCA) on Z, and then select the first 4 main components (explanatory 
power 99.9%) and the default feature 1. We next obtain the scatter plot of PCA justified 
7 ^ in Figure [^d). Results imply that PCA can eliminate colinearity effectively; however, 
the interpretability may not as effortless as Figure [^a). 


5.2 MIT Social Evolution Dataset 


This real world dataset Madan et al. (2012) is collected from a college dormitory building by 


web survey and contains the dynamic graphs G, covariates Z, and daily symptoms Y, where 
yn^t is a 6 dimensional vector including sore throat and cough, runny nose, congestion and 
sneezing, fever, nausea, vomiting and diarrhea, sadness and depression, and stress. The 
proportion of missing values Y is about 0.6. The purpose is to infer latent variables X 
and infection parameters, and making tentative health suggestions to students. Even if we 
cannot evaluate the performance on the true Xs, the Coogle search of ’’flu” [Dong et al. 
( | 2012 | ) implies a underlying correlation with this result. Since Y can be partially observed 
(no NA), one step ahead prediction on Y is possible, and obtains an accuracy at 92.09% 
(threshold is also 0.5 for binary indicator). Results are shown in Figure [To} 


2. Body mass index= p-, where m is mass/kg and h is heigh/m. 
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Figure 11: Overall Social Network 


5.3 eX-Flu Dataset 

Evaluation on the public MIT dataset seems only partially useful, since true diagnoses are 
unavailable. We describe the design, study population characteristics, and social network 
structure of a chain referral sample of 590 students living in University of Michigan residence 
halls who were randomized to an intervention of isolation over a 10-week period during the 
2013 influenza season. In our experiment, diagnoses are recorded immediately at flu onset. 


5.3.1 Design Description 


590 students living in six eligible residence halls on the University of Michigan campus 
enrolled in the eX-FLU study during a chain referral recruitment process carried out from 
September 2012-January 2013. 262 of these, as ’’seed” participants, nominated their social 
relations to join the study as well. The rest, 328, were nominees that enrolled. Participants 
have to fill out weekly surveys on web apps about their health behaviors and social interac¬ 
tions with other participants, and a symptom indicator report of influenza-like illness (ILI). 
A subsample of 103 students were provided with smartphones with a mobile application, iEpi 


Knowles et al. (2014), which is able to collect location sensor and contextually-dependent 


survey information, implying social contacts that are used in our computational model. 
This sub study experiment perfectly fits our proposed model, so the main evaluation will be 
performed on this sub dataset. Generally speaking, the underlying cumulative distribution 
of degrees for the overall social network on 590 students is shown in Figure [T^a). The 
distribution of three degree measurements (in, out, or total), were heavily right-skewed and 
over-dispersed. Consequently, the network appears scale-free, with a log-log plot shown in 
Figure [Tl](b) and a linear trend line {R^ = 0.91) illustrating the approximately power-law 
distribution for total degree. 
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Day 2 Day 27 




Figure 12: (a) iEpi Bluetooth network (N=103): Network of Bluetooth contacts between 
smartphones in the iEpi sub-study; (b) Sampled dynamic social networks from 
(a): 103 dots uniformly distributed as a large circle. Contacts within the network 
account for edges between solid dots. 


5.3.2 iEpi Sub-Study and Networks Analysis 

103 (17.5%) students of the 590 enrolled participants were equipped with provided smart¬ 
phones and joined the iEpi sub-study. They were required to use their iEpi smartphone and 
could report their symptoms, meeting the study criteria for ILL A total of 4843 contextually- 
based surveys were administered on all sub-study smartphones (mean 62.09/day), 1743 
(36.0%) of which were responded to by iEpi sub-study participants (mean 22.35/day). There 
were a total of 60131 Bluetooth contacts between smartphones within the iEpi sub-study, 
and 148,333 total Bluetooth contacts with other devices of any kind, averaging 7.48 con- 
tacts/phone/day and 20.95 contacts/person/day, respectively. 

The bluetooth detector can automatically collect contacts occurring between iEpi in¬ 
stalled smartphones, or to other smart devices. Each node (circle) in Figure [T^ a) represents 
an individual in the sub-study, and the links (edges) between nodes represent bluetooth de¬ 
tections between smartphones of individuals within the sub-study networks. Node size is 
proportional to the total number of contacts detected in the bluetooth data (equivalent to 
degree), and the link thickness indicates the contact duration between the two nodes (equiv¬ 
alent to weight on edge). During the experiment period, we also conducted a comparison 
test. Some participants (yellow nodes in Figure [T^a)) were isolated for three days at the 
onset of illness, which means no social contacts were made during this period. 

The next step is to extract daily social networks. We use the 77 days of the iEpi 
survey data which is relatively complete, and its corresponding bluetooth data to construct 
dynamic networks. Figure [T^b) illustrates 4 independently sampled sub-networks, i.e. 
t = 2, 27, 52, 77. To make more sense of the edges, only the bluetooth data showing the total 
contact duration between two participants lasting more than 10 minutes will contribute to 
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Figure 13: Left is true Onset and its Duration. Right is predicted by Sigmoid Model. 

an edge on that day. The threshold of 10 minutes can be adjusted to make the graph denser 
or sparser, thus leading to a higher or lower computational cost. 


Table 2: exFlu Epidemics State Inference Performance 


Model 

Recall 

Accuracy 

Sigmoid link 
Beta-exp link 
GCHMMs+LogReg 

0.8974 ± 0.00 
0.7436 ± 0.00 

0.7436 ± 0.00 

0.9978 ± 0.00 
0.9912 ± 0.00 

0.9912 ± 0.00 


5.3.3 iEpi Elu Diffusion Analysis 

Available illness onset diagnoses in our experiment allows for the evaluation of inferred 
infection states. We tried all three models, Sigmoid link, Beta-exponential link and stan¬ 
dard GCHMMs-hLogReg on this dataset. Because of the specific quantized distribution of 
diagnosed flu onset (see red short pattern of left graph in Figure [T^), the three methods 
perform stably, but give different results over 10 runs with no standard deviation. Though 
they all heavily rely on Gibbs sampling, the sigmoid link model can detect more short term 
patterns than the other two. Table gives both precision and recall for prediction, since 
the proportion of positive instances, unlike our simulation, is about one tenth. Even the 
sigmoid model missed some very short patterns. Two reasons may contribute to this phe¬ 
nomena; the first is that HMMs are a long distance dependent models; the second is that 
we find symptom reports for short period disease courses are always low severity. 

In contrast to other models, and serving as the mainstay and novelty of this paper, 
we aimed to learn how personal features (first column in Table were associated with 
individual flu vulnerability, i.e. coefficients 77 . A Sigmoid transform on 77 will immediately 
give infection parameters. Larger 7 ^ implies better resistance, while larger an^ indicates 
increased vulnerability. Because resistance or vulnerability is not an experimental quantity 
(difficult to measure in a real world dataset), we prefer to evaluate coefficients 77 (Table 
rather than actual infection parameters. The right three columns are the estimated 77 s 
associated with different biological meanings (indicated by their subscripts) in the Sigmoid 
model-possessing the best performance in both the simulation and real cases. Looking at 
the feature column, we can see that females seems suffer from a slower recovery but are 

3. Gender 1 means female; Alc_Day: average number of times hand washing with sanitizer; Vacc_Ever: 
previous vaccination; Flushot_Yr: vaccination this year; Act_Days: exercise in broad sense per day; 
Wash_Opt: whether wash hands exceeding 20s; High_Risk: contact with impaired immunity patient. 
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Table 3: Coefficients Estimation on exFlu Dataset 


Feature" 


Recovery r]r 

Outside Infect r]a 

Inside Infect 775 

Default= 

1 

-1.3022 ± 0.0146 

-5.1517 ± 0.0024 

-4.1619 ± 0.0281 

Gender 


-0.1575 ± 0.0118 

-0.2428 ± 0.0074 

-0.1457 ± 0.0078 

Age 


0.0074 ± 0.0082 

-0.2376 ± 0.0051 

-0.0181 ± 0.0017 

Alc_Day 


0.1090 ± 0.0078 

-0.1534 ± 0.0003 

-0.0410 ± 0.0018 

Vacc_Ever 

-0.0698 ± 0.0104 

0.1092 ± 0.0095 

0.0382 ± 0.0085 

Flushot_Yr 

0.0769 ± 0.0092 

-0.3209 ± 0.0073 

0.0837 ± 0.0055 

Smoker 


-0.1080 ± 0.0029 

-0.0536 ± 0.0008 

0.0773 ± 0.0021 

Drinker 


-0.1335 ± 0.0092 

0.0628 ± 0.0030 

0.1408 ± 0.0029 

Act _D ays 

0.0356 ± 0.0099 

0.0054 ± 0.0063 

-0.0622 ± 0.0078 

Sleep_Qual 

0.0225 ± 0.0069 

-0.3686 ± 0.0051 

-0.0162 ± 0.0077 

Wash_Opt 

0.0024 ± 0.0103 

0.0816 ± 0.0132 

-0.0714 ± 0.0048 

High Risk 

-0.1274 ± 0.0116 

-0.1252 ± 0.0058 

-0.0727 ± 0.0007 


not as likely to catch a cold. Another important factor is whether participants are addicted 
to alcohol. Drinkers significantly aggravate body immunity. However, whether or not one 
washes their hands for more than 20 s, interestingly, seems not to be significant to the model, 
especially to the recovery rate. This may be blamed on an overly long washing duration-20s 
in the experimental design. Overall, the sign consistency with respect to r] makes sense, 
with the exception of a few relationships. For the sigmoid function, positive coefficients will 
enlarge infection parameters, and vice versa. 


6. Conclusion and Discussion 


We successfully extend the HMM based epidemic infection model to general and flexi¬ 
ble hierarchical GCHMMs, enabling us to simultaneously predict individual infection and 
physical constitution by observing how flu spreads throughout dynamic social networks. 
In addition, this framework can reduce to previous models via simplifying the graphical 
model if less information is obtained, e.g. The model without a social network data will 
reduce to a standard HMM; sharing the same infection parameters will lead to a homoge¬ 
nous model. The hypothesis test on the necessity of social network has been researched 


by Salathe et al. (2010); Dong et al. (2012). The heterogeneity induced in our approach is 


validated on semi-synthetic data and epidemiological tracking data in college dormitories. 


based on prediction compared with previous work Salathe et al. (2010); Dong et al. ( 2012 ). 


On semi-simulation data, we evaluate our model on a number of metrics, including on the 
ability to correctly infer parameters. On the MIT social evolution data, we mainly focus on 
one step ahead prediction of the observed states (or symptoms). In our eX-Flu study, we 
successfully discovered the underlying social network pattern and personal feature relation¬ 
ships with respect to influenza vulnerability. In fact, the prediction on the infection state 
of the next day beyond the surveyed period is also available due to the property of Markov 
chain. From an application perspective, this is quite promising for health app development. 

Future research will explore learning a robust dynamic network generative model, since 
the bluetooth data can provide the contact duration which is a positive real value. Currently, 
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the binary indicator determined by hard threshold is not fine-grained. One possibility is 
to model stochastic processes for each edge, such as constructing a Hawkes process 


model with a latent space model, which requires combining the work of Blundell et al. 


(2012); Palla et al.| (2012). Alternatively, we may explore constructing features based on 
personal history (e.g., the contact frequency or duration between two nodes), and using this 


history feature as a predictor for the future Poisson rate Gunawardana et al.l (2011); Lian 


et al. (2015). Another possible area of future research would be to implement remark (3) 


or investigate the identifiability of the auxiliary variable, and infection network learning by 
detecting the disease spread path. We can further relax the heterogeneity assumption to a 
cluster assumption, i.e. participants with similar health feature share the same parameters. 
Inspired by HDP-HMMs Teh et al. (2006), this tradeoff can be realized by constructing a 
nonparametric version GCHMMs, enforcing similar HMMs to share the same parameters. 
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Appendix A. Update of Message Passing 
Updating A 
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Boundary Conditions 
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• Evidence nodes, i.e. yn^t^s 
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Appendix B. On Likelihood computation for 2nd Interpretation on ( 5 ^ 


Notice when to obtain (30), we use the following identity. 
N T-1 


n=l t=l n'eSn,t 

N T -1 N .. . 


n=l t=l n'=l 
N T -1 N 


T=1 t=l n=l 

fi n n ((1 - 
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